Recursos

Introducción a los modelos de pronósticos en series de tiempo

Intro

Cowpertwait, P., Metclafe, A. Introductory Time Series With R. Springer.

  • In most branches of science, engineering, and commerce, there are variables measured sequentially in time.

  • Observations that have been collected over fixed sampling intervals form a historical time series.

  • In statistics, historical series are treated as realisations of sequences of random variables.

  • The main features of many time series are trends and seasonal variations that can be modelled deterministically with mathematical functions of time. But, another important feature of most time series is that observations close together in time tend to be correlated (serially dependent).

  • Much of the methodology in a time series analysis is aimed at explaining this correlation and the main features in the data using appropriate statistical models and descriptive methods.

  • Once a good model is found and fitted to data, the analyst can use the model to forecast future values, or generate simulations, to guide planning decisions.

Sampling

  • Sampling intervals differ in their relation to the data.

  • If data are sampled, the sampling interval must be short enough for the time series to provide a very close approximation to the original continuous signal when it is interpolated.

Características de la Serie de Tiempo

data(AirPassengers)
AP <- AirPassengers

plot(AP, ylab = "Passengers (1000's)")
 

AP %>% 
  as_tsibble() %>% 
  ggplot(aes(index, value)) +
    geom_line(color='#00B2A8') +
    theme_ipsum()

* In general, a systematic change in a time series that does not appear to be periodic is known as a trend. The simplest model for a trend is a linear increase or decrease, and this is often an adequate approximation.

  • A repeating pattern within each year is known as seasonal variation, although the term is applied more generally to repeating patterns within any fixed period, such as restaurant bookings on different days of the week.

  • Sometimes we may claim there are cycles in a time series that do not correspond to some fixed natural period; examples may include business cycles or climatic oscillations such as El Ni˜no

  • A time series plot not only emphasises patterns and features of the data but can also expose outliers and erroneous values. One cause of the latter is that missing data are sometimes coded using a negative value. Such values need to be handled differently in the analysis and must not be included as observations when fitting a model to data.

AP %>% 
  as_tsibble() %>% 
  index_by(year = ~ year(.)) %>% 
  summarise(passengers = sum(value)) %>% 
  ggplot(aes(year, passengers)) +
  geom_line(color='#00B2A8') +
  theme_ipsum()


AP %>% 
  as_tsibble() %>% 
  ggplot(aes(index, value)) +
    geom_line(color='#00B2A8') +
    geom_smooth(alpha=0.7, colour = '#1C2D44', se=F, method='loess', linetype=3) +
    theme_ipsum()

AP_month <-
  AP %>% 
  as_tsibble() %>% 
  index_by(month = ~ lubridate::month(., label=F)) %>% 
  summarise(passengers = sum(value)) %>% 
  mutate(m = lubridate::month(month, label=T)) %>% 
  ungroup()

AP_month %>% 
  ggplot(aes(x=month, y=passengers)) +
  geom_line(color='#00B2A8') +
  theme_ipsum() +
  scale_x_discrete(limits = unique(AP_month$m), labels=unique(AP_month$m))
  
AP %>% 
  as_tsibble() %>% 
 # index_by(month = ~ lubridate::month(., label=T)) %>% 
#  summarise(passengers = sum(value)) %>% 
  mutate(month=month(index, label=T)) %>% 
  ggplot(aes(month, value, group=month, color=month, fill=month)) +
  geom_boxplot(stat = "boxplot",position = "dodge2") +
  theme_ipsum() +
  theme(
    legend.position = 'none'
  )

Decomposition of series and notation

  • We represent a time series of length n by {\(x_{t} : t = 1, . . . , n\)} = {\(x_{1}, x_{2}, . . . , x_{n}\)}.

  • It consists of n values sampled at discrete times 1, 2, . . . , n.

  • {xt} when the length n of the series does not need to be specified.

  • An overline is used for sample means: \[\overline{x} = \sum x_{i}/n\]

  • The ‘hat’ notation will be used to represent a prediction or forecast. For example, with the series \({x_{t} : t = 1, . . . , n}\), \(\hat{x}_{t+k|t}\) is a forecast made at time t for a future value at time t + k. A forecast is a predicted future value, and the number of time steps into the future is the lead time (k).

decomposed_AP <- decompose(AirPassengers,type = 'multiplicative')

decomposed_AP
## $x
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
## 
## $seasonal
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1949 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1950 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1951 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1952 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1953 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1954 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1955 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1956 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1957 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1958 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1959 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
## 1960 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
##            Aug       Sep       Oct       Nov       Dec
## 1949 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1950 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1951 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1952 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1953 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1954 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1955 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1956 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1957 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1958 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1959 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 1960 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 
## $trend
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1949       NA       NA       NA       NA       NA       NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
## 1952 183.1250 186.2083 189.0417 191.2917 193.5833 195.8333 198.0417 199.7500
## 1953 215.8333 218.5000 220.9167 222.9167 224.0833 224.7083 225.3333 225.3333
## 1954 228.0000 230.4583 232.2500 233.9167 235.6250 237.7500 240.5000 243.9583
## 1955 261.8333 266.6667 271.1250 275.2083 278.5000 281.9583 285.7500 289.3333
## 1956 309.9583 314.4167 318.6250 321.7500 324.5000 327.0833 329.5417 331.8333
## 1957 348.2500 353.0000 357.6250 361.3750 364.5000 367.1667 369.4583 371.2083
## 1958 375.2500 377.9167 379.5000 380.0000 380.7083 380.9583 381.8333 383.6667
## 1959 402.5417 407.1667 411.8750 416.3333 420.5000 425.5000 430.7083 435.1250
## 1960 456.3333 461.3750 465.2083 469.3333 472.7500 475.0417       NA       NA
##           Sep      Oct      Nov      Dec
## 1949 127.9583 128.5833 129.0000 129.7500
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667
## 1952 202.2083 206.2500 210.4167 213.3750
## 1953 224.9583 224.5833 224.4583 225.5417
## 1954 247.1667 250.2500 253.5000 257.1250
## 1955 293.2500 297.1667 301.0000 305.4583
## 1956 334.4583 337.5417 340.5417 344.0833
## 1957 372.1667 372.4167 372.7500 373.6250
## 1958 386.5000 390.3333 394.7083 398.6250
## 1959 437.7083 440.9583 445.8333 450.6250
## 1960       NA       NA       NA       NA
## 
## $random
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1949        NA        NA        NA        NA        NA        NA 0.9516643
## 1950 0.9626030 1.0714668 1.0374474 1.0140476 0.9269030 0.9650406 0.9835566
## 1951 1.0138446 1.0640180 1.0918541 1.0176651 1.0515825 0.9460444 0.9474041
## 1952 1.0258814 1.0939696 1.0134734 0.9695596 0.9632673 1.0003735 0.9468562
## 1953 0.9976684 1.0151646 1.0604644 1.0802327 1.0413329 0.9718056 0.9551933
## 1954 0.9829785 0.9232032 1.0044417 0.9943899 1.0119479 0.9978740 1.0237753
## 1955 1.0154046 0.9888241 0.9775844 1.0015732 0.9878755 1.0039635 1.0385512
## 1956 1.0066157 0.9970250 0.9876248 0.9968224 0.9985644 1.0275560 1.0217685
## 1957 0.9937293 0.9649918 0.9881769 0.9867637 0.9924177 1.0328601 1.0261250
## 1958 0.9954212 0.9522762 0.9469115 0.9383993 0.9715785 1.0261340 1.0483841
## 1959 0.9825176 0.9505736 0.9785278 0.9746440 1.0177637 0.9968613 1.0373136
## 1960 1.0039279 0.9590794 0.8940857 1.0064948 1.0173588 1.0120790        NA
##            Aug       Sep       Oct       Nov       Dec
## 1949 0.9534014 1.0022198 1.0040278 1.0062701 1.0118119
## 1950 0.9733720 1.0225047 0.9721928 0.9389527 1.0067914
## 1951 0.9397599 0.9888637 0.9938809 1.0235337 1.0250824
## 1952 0.9931171 0.9746302 1.0046687 1.0202797 1.0115407
## 1953 0.9894989 0.9934337 1.0192680 1.0009392 0.9915039
## 1954 0.9845184 0.9881036 0.9927613 0.9995143 0.9908692
## 1955 0.9831117 1.0032501 1.0003084 0.9827720 1.0125535
## 1956 1.0004765 1.0008730 0.9835071 0.9932761 0.9894251
## 1957 1.0312668 1.0236147 1.0108432 1.0212995 1.0005263
## 1958 1.0789695 0.9856540 0.9977971 0.9802940 0.9405687
## 1959 1.0531001 0.9974447 1.0013371 1.0134608 0.9999192
## 1960        NA        NA        NA        NA        NA
## 
## $figure
##  [1] 0.9102304 0.8836253 1.0073663 0.9759060 0.9813780 1.1127758 1.2265555
##  [8] 1.2199110 1.0604919 0.9217572 0.8011781 0.8988244
## 
## $type
## [1] "multiplicative"
## 
## attr(,"class")
## [1] "decomposed.ts"

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from itertools import combinations
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima.model import ARIMA as ARIMA
import statsmodels.api as sm
import statsmodels.tsa.api as smt
import matplotlib
from statsmodels.tsa.seasonal import seasonal_decompose
matplotlib.use('Qt5Agg')

# 'https://www.kaggle.com/rakannimer/air-passengers'

data=pd.read_csv('AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Month'])
data = data.drop(columns = 'Month')
data = data.set_index('Date')
data = data.rename(columns = {'#Passengers':'Passengers'})
data.head()
##             Passengers
## Date                  
## 1949-01-01         112
## 1949-02-01         118
## 1949-03-01         132
## 1949-04-01         129
## 1949-05-01         121
result=seasonal_decompose(data['Passengers'], model='multiplicative', period=12)

p=result.plot()

p.savefig('decomposed_p.png')

Modelos de pronósticos en series de tiempo

Additive decomposition model

  • A simple additive decomposition model is given by: \[x_{t} = m_{t} + s_{t} + z_{t}\] where, at time t, \(x_{t}\) is the observed series, \(m_{t}\) is the trend, \(s_{t}\) is the seasonal effect, and \(z_{t}\) is an error term that is, in general, a sequence of correlated random variables with mean zero.

  • If the seasonal effect tends to increase as the trend increases, a multiplicative model may be more appropriate: \[x_{t} = m_{t} * s_{t} +z_{t}\]

  • If the random variation is modelled by a multiplicative factor and the variable is positive, an additive decomposition model for \(log(x_{t})\) [log means natural logarithm, sometimes written \(ln\)] can be used: \[log(x_{t}) = m_{t} + s_{t} + z_{t}\]

  • Some care is required when the exponential function is applied to the predicted mean of \(log(x_{t})\) to obtain a prediction for the mean value \(x_{t}\), as the effect is usually to bias the predictions. If the random series \(z_{t}\) are normally distributed with mean 0 and variance \(\sigma^{2}\), then the predicted mean value at time t based on the above equation is given by: \[\hat{x}_{t} = e^{m_{t}+s_{t}}e^{\frac{1}{2}\sigma^{2}}\] However, if the error series is not normally distributed and is negatively skewed (i.e. its probability has a long tail to the left) as is often the case after taking logarithms, the bias correction factor will be an overcorrection and it is preferable to apply an empirical adjustment.

Estimating trends and seasonal effects

  • There are various ways to estimate the trend \(m_{t}\) at time \(t\), but a relatively simple procedure, which is available in R and does not assume any specific form is to calculate a moving average centred on \(x_{t}\).

  • For monthly series, we need to average twelve consecutive months, but there is a slight snag. Suppose our time series begins at January (\(t = 1\)) and we average January up to December (\(t = 12\)). This average corresponds to a time \(t = 6.5\), between June and July. When we come to estimate seasonal effects, we need a moving average at integer times. This can be achieved by averaging the average of January up to December and the average of February (\(t = 2\)) up to January (\(t = 13\)). This average of two moving averages corresponds to t = 7, and the process is called centring. Thus the trend at time t can be estimated by the centred moving average \[\hat{m_{t}} = \frac{\frac{1}{2}x_{t-6} + x_{t-5} + ... + x_{t-1} + x_{t} + x_{t+1} + ... x_{t+5} + \frac{1}{2}x_{t+6}}{12}\] where \(t = 7, . . . , n − 6\). The coefficients in above equation for each month are 1/12 (or sum to 1/12 in the case of the first and last coefficients), so that equal weight is given to each month and the coefficients sum to 1. By using the seasonal frequency for the coefficients in the moving average, the procedure generalises for any seasonal frequency (e.g., quarterly series), provided the condition that the coefficients sum to unity is still met.

  • An estimate of the monthly additive effect (\(s_{t}\)) at time \(t\) can be obtained by subtracting \(\hat{m}_{t}\): \[\hat{s}_{t} = x_{t} − \hat{m}_{t}\]

  • If the monthly effect is multiplicative, the estimate is given by division: \[\hat{s}_{t} = x_{t}/\hat{m}_{t}\]

Smoothing

The centred moving average is an example of a smoothing procedure that is applied retrospectively to a time series with the objective of identifying an underlying signal or trend.

Modelos estacionarios y no estacionarios

  • The mean function of a time series model is \[\mu(t) = E(x_{t})\] The expectation in this definition is an average taken across the ensemble of all the possible time series that might have been produced by the time series model.

  • The ensemble constitutes the entire population. If we have a time series model, we can simulate more than one time series.

  • If the mean function is constant, we say that the time series model is stationary in the mean, \[\overline{x} = \sum_{t=1}^{n} x_{t}/n\] This equation does rely on an assumption that a sufficiently long time series characterises the hypothetical model. Such models are known as ergodic. \[\displaystyle \lim_{x \to \infty} \frac{\sum{x_{t}}}{n} = \mu\]

  • It is straightforward to adapt a stationary time series model to be non-ergodic by defining the means for the individual time series to be from some probability distribution.

  • The variance function of a time series model that is stationary in the mean is \[\sigma^{2}(t) = E[(x_{t} - \mu^2)]\] which can, in principle, take a different value at every time \(t\). But we cannot estimate a different variance at each time point from a single time series. To progress, we must make some simplifying \(\sigma^{2}\), can be estimated from the sample variance: \[Var(x) = \frac{\Sigma(x_{t} - \overline{x})^2}{n-1}\]

  • In a time series analysis, sequential observations may be correlated. If the correlation is positive, \(Var(x)\) will tend to underestimate the population variance in a short series because successive observations tend to be relatively similar. In most cases, this does not present a problem since the bias decreases rapidly as the length n of the series increases.

Autocorrelation

  • The mean and variance play an important role in the study of statistical distributions because they summarise two key distributional properties – a central location and the spread. Similarly, in the study of time series models, a key role is played by the second-order properties, which include the mean, variance, and serial correlation.

  • A correlation of a variable with itself at different times is known as autocorrelation or serial correlation. If a time series model is second-order stationary, we can define an autocovariance function (acvf ), \(r_{k}\) , as a function of the lag \(k\): \[\gamma_{k} = E[(x_{t} - \mu)(x_{t+k}-\mu)]\]

  • The function \(\gamma_{k}\) does not depend on \(t\) because the expectation, which is across the ensemble, is the same at all times \(t\). This definition follows naturally from the above equation by replacing \(x\) with \(x_{t}\) and \(y\) with \(x_{t+k}\) and noting that the mean \(\mu\) is the mean of both \(x_{t}\) and \(x_{t+k}\). The lag \(k\) autocorrelation function \((acf)\), \(\rho_{k}\) , is defined by: \[\rho_{k} = \frac{\gamma_{k}}{\sigma^{2}}\]

  • The acvf and acf can be estimated from a time series by their sample equivalents. The sample acvf, ck, is calculated as \[c_{k} = \frac{1}{n} \sum_{t = 1}^{n-k} (x_{t}-\overline{x})(x_{t+k} - \overline{x})\]

  • The sample acf is defined as \[r_{k} = \frac{c_{k}}{c_{0}}\]

AirPassengers %>% 
  as_tsibble %>% 
  ggplot(aes(index,value)) +
  geom_line(colour='#00B2A8') +
  theme_ipsum()

# autocorrelation
acf(AirPassengers, )$acf
## , , 1
## 
##            [,1]
##  [1,] 1.0000000
##  [2,] 0.9480473
##  [3,] 0.8755748
##  [4,] 0.8066812
##  [5,] 0.7526254
##  [6,] 0.7137700
##  [7,] 0.6817336
##  [8,] 0.6629044
##  [9,] 0.6556105
## [10,] 0.6709483
## [11,] 0.7027199
## [12,] 0.7432402
## [13,] 0.7603950
## [14,] 0.7126609
## [15,] 0.6463423
## [16,] 0.5859234
## [17,] 0.5379552
## [18,] 0.4997475
## [19,] 0.4687340
## [20,] 0.4498707
## [21,] 0.4416288
## [22,] 0.4572238
# autocovariance
acf(AirPassengers, type = c("covariance"))$acf
## , , 1
## 
##            [,1]
##  [1,] 14291.973
##  [2,] 13549.467
##  [3,] 12513.692
##  [4,] 11529.066
##  [5,] 10756.502
##  [6,] 10201.181
##  [7,]  9743.318
##  [8,]  9474.212
##  [9,]  9369.968
## [10,]  9589.176
## [11,] 10043.254
## [12,] 10622.369
## [13,] 10867.546
## [14,] 10185.330
## [15,]  9237.507
## [16,]  8374.002
## [17,]  7688.441
## [18,]  7142.378
## [19,]  6699.134
## [20,]  6429.540
## [21,]  6311.747
## [22,]  6534.630

* The x-axis gives the \(lag (k)\) and the y-axis gives the autocorrelation \((r_{k})\) at each lag. The unit of lag is the sampling interval, 0.1 second. Correlation is dimensionless, so there is no unit for the y-axis.

  • If \(\rho_{k} = 0\), the sampling distribution of \(r_{k}\) is approximately normal, with a mean of \(\frac{−1}{n}\) and a variance of \(\frac{1}{n}\). The dotted lines on the correlogram are drawn at \[-\frac{1}{n}\pm\frac{2}{\sqrt{n}}\]

  • The main use of the correlogram is to detect autocorrelations in the time series after we have removed an estimate of the trend and seasonal variation.

AP_decomposed <- decompose(AirPassengers, "multiplicative")
plot(ts(AP_decomposed$random[7:138]))  

acf(AP_decomposed$random[7:138])

  • The correlogram above suggests either a damped cosine shape that is characteristic of an autoregressive model of order 2 (Chapter 4) or that the seasonal adjustment has not been entirely effective. The latter explanation is unlikely because the decomposition does estimate twelve independent monthly indices. If we investigate further, we see that the standard deviation of the original series from July until June is 109, the standard deviation of the series after subtracting the trend estimate is 41, and the standard deviation after seasonal adjustment is just 0.03.
sd(AP[7:138])
## [1] 109.4187
sd(AP[7:138] - AP_decomposed$trend[7:138])
## [1] 41.11491
sd(AP_decomposed$random[7:138])
## [1] 0.0333884

ARIMA

As the differenced series needs to be aggregated (or ‘integrated’) to recover the original series, the underlying stochastic process is called ARIMA (Auto Regressive Integrated Moving Average). The ARIMA process can be extended to include seasonal terms, giving a non-stationary seasonal ARIMA (SARIMA) process. Seasonal ARIMA models are powerful tools in the analysis of time series as they are capable of modelling a very wide range of series. Much of the methodology was pioneered by Box and Jenkins in the 1970’s.

Series may also be non-stationary because the variance is serially correlated (technically known as conditionally heteroskedastic), which usually re- sults in periods of volatility, where there is a clear change in variance.

  • AR Auto Regressive. ‘p’ is the number of auto regressive terms.

  • I (d): Integrated. Number of differencing orders required to make the time series stationary.

  • MA (q): Moving Average. Number of lagged forecast errors in the prediction equation.

  • p,d, & q are known as the order of the ARIMA model.

Selection criteria for the order of ARIMA model:

p: Lag value where the Partial Autocorrelation (PACF) graph cuts off or drops to 0 for the 1st instance. In general, the “partial” correlation between two variables is the amount of correlation between them which is not explained by their mutual correlations with a specified set of other variables. For example, if we are regressing a variable Y on other variables X1, X2, and X3, the partial correlation between Y and X3 is the amount of correlation between Y and X3 that is not explained by their common correlations with X1 and X2. This partial correlation can be computed as the square root of the reduction in variance that is achieved by adding X3 to the regression of Y on X1 and X2.

d: Number of times differencing is carried out to make the time series stationary.

q: Lag value where the Autocorrelation (ACF) graph crosses the upper confidence interval for the 1st instance. A partial autocorrelation is the amount of correlation between a variable and a lag of itself that is not explained by correlations at all lower-order-lags. The autocorrelation of a time series Y at lag 1 is the coefficient of correlation between Yt and Yt-1, which is presumably also the correlation between Yt-1 and Yt-2. But if Yt is correlated with Yt-1, and Yt-1 is equally correlated with Yt-2, then we should also expect to find correlation between Yt and Yt-2. In fact, the amount of correlation we should expect at lag 2 is precisely the square of the lag-1 correlation. Thus, the correlation at lag 1 “propagates” to lag 2 and presumably to higher-order lags. The partial autocorrelation at lag 2 is therefore the difference between the actual correlation at lag 2 and the expected correlation due to the propagation of correlation at lag 1.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from itertools import combinations
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima.model import ARIMA as ARIMA
import statsmodels.api as sm
import statsmodels.tsa.api as smt
import matplotlib
matplotlib.use('Qt5Agg')

# 'https://www.kaggle.com/rakannimer/air-passengers'

data=pd.read_csv('AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Month'])
data = data.drop(columns = 'Month')
data = data.set_index('Date')
data = data.rename(columns = {'#Passengers':'Passengers'})
data.head()
##             Passengers
## Date                  
## 1949-01-01         112
## 1949-02-01         118
## 1949-03-01         132
## 1949-04-01         129
## 1949-05-01         121
result=seasonal_decompose(data['Passengers'], model='multiplicative', period=12)

p=result.plot()

p.savefig('decomposed_p.png')

  • Linear uptrend.
  • Yearly seasonal pattern

Testing Stationarity

  • Augmented Dickey Fuller Test
  • Null Hypothesis: It assumes that the time series is non-stationary.
  • Alternate Hypothesis: If the null hypothesis is rejected, then the time series is stationary.

Output of the Augmented Dickey Fuller Test include : * Test Statistic * p-value * #Lags Used * Number of Observations Used * Critical Value (1%) * Critical Value (5%) * Critical Value (10%)

For the null hypothesis to be rejected there are 2 requirements:

  • Critical Value (5%) > Test Statistic
  • p-value < 0.05
def test_stationarity(timeseries):
    #Determing rolling statistics
    MA = timeseries.rolling(window=12).mean()
    MSTD = timeseries.rolling(window=12).std()

    #Plot rolling statistics:
    plt.figure(figsize=(15,5))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(MA, color='red', label='Rolling Mean')
    std = plt.plot(MSTD, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)

    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

test_stationarity(data['Passengers'])
## Results of Dickey-Fuller Test:
## Test Statistic                   0.815369
## p-value                          0.991880
## #Lags Used                      13.000000
## Number of Observations Used    130.000000
## Critical Value (1%)             -3.481682
## Critical Value (5%)             -2.884042
## Critical Value (10%)            -2.578770
## dtype: float64

  • Test Statistic: 0.82 > Critical Value (5%) :-2.88
  • p-value 0.99 > 0.05

Hence the null hypothesis cannot be rejected and we can conclude that the time series is non-stationary.

In order to eliminate trend, seasonality and make the time series stationary, we will use differencing i.e subtracting the previous value from it’s next value and log transform. We must make note of how many differencing iterations have been done as this is d.


# Original Series
fig, (ax1, ax2, ax3) = plt.subplots(3)
ax1.plot(data.Passengers); ax1.set_title('Original Series'); ax1.axes.xaxis.set_visible(False)
# 1st Differencing
ax2.plot(data.Passengers.diff()); ax2.set_title('1st Order Differencing'); ax2.axes.xaxis.set_visible(False)
# 2nd Differencing
ax3.plot(data.Passengers.diff().diff()); ax3.set_title('2nd Order Differencing')
plt.show()

from statsmodels.graphics.tsaplots import plot_acf
fig, (ax1, ax2, ax3) = plt.subplots(3)
plot_acf(data.Passengers, ax=ax1)
plot_acf(data.Passengers.diff().dropna(), ax=ax2)
plot_acf(data.Passengers.diff().diff().dropna(), ax=ax3)
plt.show()

Here we can see how the time series has become ‘almost’ stationary. One thing which is noticeable here is in first-order differencing we have less noise in the data while after 1st order there is an increase in the noise.

In second-order differencing the immediate lag has gone on the negative side, representing that in the second-order the series has become over the difference.

So we can select 1st order differencing for our model. We will also log transform the data.

data_diff = data
data_diff['Passengers'] = np.log(data['Passengers'])
data_diff = data_diff.diff()
data_diff = data_diff.dropna()
dec = sm.tsa.seasonal_decompose(data_diff,period = 12).plot()
plt.show()

Then we can check again for stationarity with Dickey Fuller test:

test_stationarity(data_diff)
## Results of Dickey-Fuller Test:
## Test Statistic                  -2.717131
## p-value                          0.071121
## #Lags Used                      14.000000
## Number of Observations Used    128.000000
## Critical Value (1%)             -3.482501
## Critical Value (5%)             -2.884398
## Critical Value (10%)            -2.578960
## dtype: float64

From the outputs of the Augmented Dickey Fuller Test:

  • Rolling Mean is close to 0 – ok
  • Rolling Standard Deviation is close to 0.1 – ok
  • Test Statistic: -2.37 < Critical Value 5%: -2.88 – ok
  • p-value: 0.07 > 0.05 – not!

The time series is still strictly speaking non-stationary, but we will proceed as though it were stationary.

def tsplot(y, lags=None, figsize=(12, 7), style='bmh'):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        p_value = sm.tsa.stattools.adfuller(y)[1]
        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax,method='ywm')
        plt.tight_layout()
        plt.show()

tsplot(data_diff['Passengers'])

To remind ourselves:

  • p is the number of autoregressive terms. Draw a partial autocorrelation graph(ACF) of the data. This will help us in finding the value of p because the cut-off point to the PACF is p. Using the PACF plot we can take the order of AR terms to be equal to the lags that can cross a significance limit.

  • d is the number of nonseasonal differences. If the time series is stationary try to fit the ARMA model, and if the time series is non-stationary then seek the value of d. We used 1 differencing term (see diff())

  • q is the number of lagged forecast errors in the prediction equation. Draw an autocorrelation graph (ACF) of the data. It will tell us how much moving average is required to remove the autocorrelation from the stationary time series. Number of points which

From the above plots, the following order of ARIMA model is selected from the selection criteria mentioned above:

  • p: 2
  • d: 1
  • q: 2

Thus, the parameters p and q are selected in such a way that we pass their values assuming the ARIMA model carries out the differencing process and makes the time series stationary.


import warnings
warnings.filterwarnings('ignore')

model = ARIMA(
  data['Passengers'],
  order = (2,1,2)
  )

model_fit = model.fit()

print(model_fit.summary())
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:             Passengers   No. Observations:                  144
## Model:                 ARIMA(2, 1, 2)   Log Likelihood                 128.890
## Date:                Wed, 15 Feb 2023   AIC                           -247.780
## Time:                        13:55:02   BIC                           -232.965
## Sample:                    01-01-1949   HQIC                          -241.760
##                          - 12-01-1960                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## ar.L1          0.2431      0.166      1.469      0.142      -0.081       0.568
## ar.L2          0.2677      0.199      1.344      0.179      -0.123       0.658
## ma.L1         -0.0867      0.118     -0.732      0.464      -0.319       0.145
## ma.L2         -0.6813      0.142     -4.786      0.000      -0.960      -0.402
## sigma2         0.0096      0.002      5.730      0.000       0.006       0.013
## ===================================================================================
## Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):                 7.04
## Prob(Q):                              0.94   Prob(JB):                         0.03
## Heteroskedasticity (H):               1.08   Skew:                            -0.06
## Prob(H) (two-sided):                  0.79   Kurtosis:                         1.92
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).

In-sample Forecasting

  • Model forecasts values for the existing data points of the time series. It is similar to the train - test format data split.

  • We divide the data into train and test datasets. We reserve the last 30 elements for the test set and the remaining for the train set.

  • For this in-sample Forecasting, we use the rolling forecast method i.e we predict or forecast a single value and use this predicted value again for model fitting for predicting the next value.

import warnings
warnings.filterwarnings('ignore')

size = int(len(data) - 30)
train, test = data['Passengers'][0:size], data['Passengers'][size:len(data)]

print('\t ARIMA MODEL : In- Sample Forecasting \n')
##   ARIMA MODEL : In- Sample Forecasting
history = [x for x in train]
predictions = []

for t in range(len(test)):
    
    model = ARIMA(history, order=(2,1,2))
    model_fit = model.fit()
    
    output = model_fit.forecast()
    yhat = output[0]
    predictions.append(float(yhat))
    
    obs = test[t]
    history.append(obs)
    
    print('predicted = %f, expected = %f' % (yhat, obs))
    
## predicted = 6.061629, expected = 6.196444
## predicted = 6.168390, expected = 6.224558
## predicted = 6.173450, expected = 6.001415
## predicted = 5.930065, expected = 5.883322
## predicted = 5.914084, expected = 5.736572
## predicted = 5.716254, expected = 5.820083
## predicted = 5.909268, expected = 5.886104
## predicted = 5.861412, expected = 5.834811
## predicted = 5.855939, expected = 6.006353
## predicted = 6.032087, expected = 5.981414
## predicted = 5.929348, expected = 6.040255
## predicted = 6.069370, expected = 6.156979
## predicted = 6.115611, expected = 6.306275
## predicted = 6.304535, expected = 6.326149
## predicted = 6.246165, expected = 6.137727
## predicted = 6.096042, expected = 6.008813
## predicted = 6.002707, expected = 5.891644
## predicted = 5.893924, expected = 6.003887
## predicted = 6.062999, expected = 6.033086
## predicted = 5.998166, expected = 5.968708
## predicted = 5.987891, expected = 6.037871
## predicted = 6.049423, expected = 6.133398
## predicted = 6.133437, expected = 6.156979
## predicted = 6.130192, expected = 6.282267
## predicted = 6.290666, expected = 6.432940
## predicted = 6.390774, expected = 6.406880
## predicted = 6.349148, expected = 6.230481
## predicted = 6.177135, expected = 6.133398
## predicted = 6.143185, expected = 5.966147
## predicted = 5.940730, expected = 6.068426
predictions_series = pd.Series(predictions, index = test.index)
fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (15,5))
plt.subplot(1,1,1)
plt.plot(data['Passengers'],label = 'Expected Values')
plt.plot(predictions_series, label = 'Predicted Values');
plt.legend(loc="upper left")
plt.show()

error = np.sqrt(mean_squared_error(test,predictions))
print('Test RMSE: %.4f' % error)
## Test RMSE: 0.1067
  • Test RMSE value is relatively high.

  • The effect of seasonality and the order of the ARIMA model might have had an effect on model performance.


Out-of-sample Forecasting
  • Model forecasts values for future data points by creating a datetime index for the time series.

  • We create a new data frame with the future index values and same columns as the data frame that we use for model fitting.

  • We use the rolling method by using forecast function and predict function.

  • In the rolling method, we forecast or predict the next single value, use this predicted value again for model fitting that is then used for predicting the next value.

from pandas.tseries.offsets import DateOffset
import datetime

future_dates = [data.index[-1] + DateOffset(weeks = x) for x in range(0,49)]

# New dataframe for storing the future values
df1 = pd.DataFrame(index = future_dates[1:], columns = data.columns)

forecast = pd.concat([data, df1])

forecast['ARIMA_Forecast_Function'] = np.NaN

forecast['ARIMA_Predict_Function'] = np.NaN

forecast.head()
##             Passengers  ARIMA_Forecast_Function  ARIMA_Predict_Function
## 1949-01-01    4.718499                      NaN                     NaN
## 1949-02-01    4.770685                      NaN                     NaN
## 1949-03-01    4.882802                      NaN                     NaN
## 1949-04-01    4.859812                      NaN                     NaN
## 1949-05-01    4.795791                      NaN                     NaN
ARIMA_history_f = [x for x in train]

f1 = []

for t in range(len(df1)):
    
    model = ARIMA(ARIMA_history_f, order = (2,1,2))
    model_fit = model.fit()
    
    output = model_fit.forecast()[0]
    
    ARIMA_history_f.append(output)
    f1.append(output)
    
for i in range(len(f1)):

    forecast.iloc[144 + i, 1] = f1[i]

forecast.tail()
##             Passengers  ARIMA_Forecast_Function  ARIMA_Predict_Function
## 1961-10-05         NaN                 5.957435                     NaN
## 1961-10-12         NaN                 5.957435                     NaN
## 1961-10-19         NaN                 5.957435                     NaN
## 1961-10-26         NaN                 5.957435                     NaN
## 1961-11-02         NaN                 5.957435                     NaN
predictions_series = pd.Series(predictions, index = test.index)

fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (12,8))
plt.subplot(1,1,1)
plt.plot(forecast[['Passengers','ARIMA_Forecast_Function']], label = ['Observed Values','Forecast Values'])
plt.legend(loc="upper left")
plt.show()

Predict Function

ARIMA_history_p = [x for x in train]
f2 = []

for t in range(len(df1)):
    
    model = ARIMA(ARIMA_history_p, order = (2,1,2))
    
    model_fit = model.fit()
    
    output = model_fit.predict(start = len(ARIMA_history_p),
                               end = len(ARIMA_history_p),
                               typ = 'levels')[0]
    
    ARIMA_history_p.append(output)
    f2.append(output)
    
for i in range(len(f2)):
    forecast.iloc[144 + i,2] = f2[i]
forecast.tail()
##             Passengers  ARIMA_Forecast_Function  ARIMA_Predict_Function
## 1961-10-05         NaN                 5.957435                5.957435
## 1961-10-12         NaN                 5.957435                5.957435
## 1961-10-19         NaN                 5.957435                5.957435
## 1961-10-26         NaN                 5.957435                5.957435
## 1961-11-02         NaN                 5.957435                5.957435
predictions_series = pd.Series(predictions, index = test.index)

fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (12,8))
plt.subplot(1,1,1)
plt.plot(forecast[['Passengers','ARIMA_Predict_Function']], label = ['Observed Values','Predicted Values'])
plt.legend(loc="upper left")
plt.show()

  • For the above trained ARIMA model, values generated by forecast_function and predict function are either exactly identical or just differ by a few decimal points!

  • The model clearly did not capture the seasonal patterns of the data.

SARIMA

*Seasonal Auto Regressive Integrated Moving Average model is an extension of the ARIMA model that can handle the seasonal effects of the data. It has two orders: p,d,q and P,D,Q,M.

  • p,d,q) is an order that is similar to the order of the ARIMA model.

  • P,D,Q,M is known as the Seasonal Order where P,D,Q are similar to p,d,q of ARIMA.

  • To select M, we need to difference the data by the periodicity of the seasonal terms M and then check the PACF & ACF values at the Mth lag value.


data_diff_seas = data_diff.diff(12)
data_diff_seas = data_diff_seas.dropna()
dec = sm.tsa.seasonal_decompose(data_diff_seas,period = 12)
dec.plot()
plt.show()

  • Our data is in monthly format and the seasonal period is of 1 year.

  • Hence, we difference the already differenced data by a periodicity, M, value of 12.

  • The seasonality of the data has not completely died down but it’s values have been dropped.

  • We will check this seasonal differenced data for stationarity.


test_stationarity(data_diff_seas['Passengers'])
## Results of Dickey-Fuller Test:
## Test Statistic                  -4.443325
## p-value                          0.000249
## #Lags Used                      12.000000
## Number of Observations Used    118.000000
## Critical Value (1%)             -3.487022
## Critical Value (5%)             -2.886363
## Critical Value (10%)            -2.580009
## dtype: float64

From the outputs of the Augmented Dickey Fuller Test:

  • Rolling Mean is very close to 0.

  • Rolling Standard Deviation is almost constant with certain crests.

  • Critical Value 1%: - 3.48 > Test Statistic: -4.424645. We can say that the time series is stationary with 99% confidence as the Test Statistic is less than Critical Value (1%) as well.

  • p-value 0.000268 < 0.05

  • We can reject the null hypothesis i.e the above time series is stationary.


tsplot(data_diff_seas['Passengers'])

  • p,d,q retain the same values as the ones used in ARIMA (2,1,2)

  • P,D,Q,M:

  • P : 1, D : 1, Q : 1, M : 12

model = sm.tsa.statespace.SARIMAX(data['Passengers'],order = (2,1,2),seasonal_order = (0,1,1,12))
model_fit = model.fit()
print(model_fit.summary())
##                                       SARIMAX Results                                       
## ============================================================================================
## Dep. Variable:                           Passengers   No. Observations:                  144
## Model:             SARIMAX(2, 1, 2)x(0, 1, [1], 12)   Log Likelihood                 245.755
## Date:                              Wed, 15 Feb 2023   AIC                           -479.511
## Time:                                      13:55:48   BIC                           -462.260
## Sample:                                  01-01-1949   HQIC                          -472.501
##                                        - 12-01-1960                                         
## Covariance Type:                                opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## ar.L1         -0.0322      0.974     -0.033      0.974      -1.940       1.876
## ar.L2          0.2451      0.255      0.962      0.336      -0.254       0.745
## ma.L1         -0.3644      0.998     -0.365      0.715      -2.321       1.592
## ma.L2         -0.2540      0.571     -0.445      0.656      -1.373       0.865
## ma.S.L12      -0.5666      0.112     -5.077      0.000      -0.785      -0.348
## sigma2         0.0013      0.000      8.249      0.000       0.001       0.002
## ===================================================================================
## Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 3.78
## Prob(Q):                              0.95   Prob(JB):                         0.15
## Heteroskedasticity (H):               0.61   Skew:                            -0.03
## Prob(H) (two-sided):                  0.10   Kurtosis:                         3.83
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).

In-sample Forecasting

In-sample Forecasting: Model forecasts values for the existing data points of the time series. It is similar to the train - test format for regression or classification problems.

We divide the data into train and test dataset. We reserve the last 30 elements for the test dataset and the remaining for the train dataset similar to the approach of ARIMA model.

For this In-sample Forecasting, we use the rolling forecast method i.e we predict or forecast a single value and use this predicted value again for model fitting for predicting the next value.

error = np.sqrt(mean_squared_error(test,predictions))
print('Test RMSE: %.4f' % error)
## Test RMSE: 0.0368

Out - of - Sample Forecasting
  • Out-of-Sample Forecasting: Model forecasts values for the future data points by generating the datetime index values of the time series.

  • We create a new dataframe with the future index values and same columns as the data frame that we use for model fitting.

  • We use the non-rolling method by using forecast and predict functions.

  • In the non-rolling method, we forecast or predict all future values at once. We store these future values into 2 new columns SARIMA_Forecast_Function and SARIMA_Predict_Function in the existing forecast dataframe.

forecast['SARIMA_Forecast_Function'] = np.NaN
forecast['SARIMA_Predict_Function'] = np.NaN
forecast.head()
##             Passengers  ...  SARIMA_Predict_Function
## 1949-01-01    4.718499  ...                      NaN
## 1949-02-01    4.770685  ...                      NaN
## 1949-03-01    4.882802  ...                      NaN
## 1949-04-01    4.859812  ...                      NaN
## 1949-05-01    4.795791  ...                      NaN
## 
## [5 rows x 5 columns]

SARIMA_history_f = [x for x in train]

f3 = []

for t in (range(len(df1))):
    
    model = sm.tsa.statespace.SARIMAX(SARIMA_history_f,order = (2,1,2),seasonal_order = (0,1,1,12))
    model_fit = model.fit()
    
    output = model_fit.forecast()[0]
    
    SARIMA_history_f.append(output)
    f3.append(output)
    
for i in range(len(f3)):
    forecast.iloc[144 + i,3] = f3[i]

forecast.tail()
##             Passengers  ...  SARIMA_Predict_Function
## 1961-10-05         NaN  ...                      NaN
## 1961-10-12         NaN  ...                      NaN
## 1961-10-19         NaN  ...                      NaN
## 1961-10-26         NaN  ...                      NaN
## 1961-11-02         NaN  ...                      NaN
## 
## [5 rows x 5 columns]

fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (12,8))
plt.subplot(1,1,1)
plt.plot(forecast[['Passengers','SARIMA_Forecast_Function']], label = ['Observed Values','Predicted Values'])
plt.legend(loc="upper left")
plt.show()

From the above graph, we can say that SARIMA learned the trend and seasonal pattern but it did not understand the increasing width of the seasonal pattern that is a characteristic feature of a multiplicative model.

SARIMA_history_p = [x for x in train]
f4 = []

for t in range(len(df1)):
    
    model = sm.tsa.statespace.SARIMAX(SARIMA_history_p,order = (2,1,2),seasonal_order = (0,1,1,12))
    model_fit = model.fit()
    
    output = model_fit.predict(start = len(SARIMA_history_p),end = len(SARIMA_history_p),typ = 'levels')[0]
    
    SARIMA_history_p.append(output)
    f4.append(output)
    
for i in range(len(f4)):
    forecast.iloc[144 + i,4] = f4[i]
forecast.tail()
##             Passengers  ...  SARIMA_Predict_Function
## 1961-10-05         NaN  ...                 6.144794
## 1961-10-12         NaN  ...                 6.290488
## 1961-10-19         NaN  ...                 6.263320
## 1961-10-26         NaN  ...                 6.288518
## 1961-11-02         NaN  ...                 6.453430
## 
## [5 rows x 5 columns]

fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (12,8))
plt.subplot(1,1,1)
plt.plot(forecast[['Passengers','SARIMA_Predict_Function']], label = ['Observed Values','Predicted Values'])
plt.legend(loc="upper left")
plt.show()